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■ Abstract 

C^h' The problem of model selection by cross-validation is addressed in the density esti- 

, mation framework. Extensively used in practice, cross-validation (CV) remains poorly 

understood, especially in the non-asymptotic setting which is the main concern of this 
work. 

A recurrent problem with CV is the computation time it involves. This drawback is 
overcome here thanks to closed-form expressions for the CV estimator of the risk for 
a broad class of widespread estimators: projection estimators. 

In order to shed new lights on CV procedures with respect to the cardinality p of 
• the test set, the CV estimator is interpreted as a penalized criterion with a random 

, penalty. For instance, the amount of penalization is shown to increase with p. 

A theoretical assessment of the CV performance is carried out thanks to two oracle 
inequalities applying to respectively bounded or square- integrable densities. For sev- 
eral collections of models, adaptivity results with respect to Holder and Besov spaces 
! are derived as well. 

' Keywords: Density estimation, cross-validation, model selection, leave-p-out, random 

OO ■ penalty, oracle inequality, projection estimators, adaptivity in the minimax sense. Holder, 

^ ', Besov. 

op ! 1 Introduction 

O 

The main concern of this paper is the analysis of cross-validation procedures when em- 
^ ] ployed to perform model selection in the density estimation context. This analysis results 

?H I in a new understanding of CV behaviour as well as in several optimality results. Before 

entering into details, let us briefly describe related works in the model selection area. 



1.1 Model selection 

Model selection via penalization has been introduced by the seminal papers of Mallows 
(1973) on Cp, and Akaike (1973) about AIC, and also by Schwarz (1978) who proposed the 
BIG criterion. AIC and BIG have an asymptotic flavour, which makes their performance 
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depend on the model collection in hand as well as on the sample size (see Baraud et al., 
2009). 

More recently, Birge and Massart (1997, 2001, 2006) have developed a non-asymptotic 
approach, inspired from the pioneering work of Barron and Cover (1991). It aims at 
choosing a model among a countable family {Sm}meMn where Ain is allowed to depend 
on the sample size n. From this point of view, an estimator Sm is associated with each 
model Sm, and a penalized criterion is designed and then minimized to provide a final 
estimator s = The goal of this approach is efficiency, that is the risk of s is as 

small as the smallest achievable risk by any of the estimators in the collection. Actually, 
this cannot be reached in the non-asymptotic setting and the quality assessment of the 
procedure is made through an oracle inequality. Such an inequality instead asserts that 
the risk of s is almost the same as that of the smallest achievable one up to a multiplicative 
constant > 1 and a remainder term. When Cn converges to 1 as n tends to infinity, 
the model selection procedure is said asymptotically efficient. 

In the density estimation framework, Barron et al. (1999) developed a general ap- 
proach based on deterministic penalties, leading to an oracle inequality involving Kullback- 
Leibler divergence and Hellinger distance. This result has been adapted to the particular 
case of histograms by Castellan (1999, 2003) and further studied in Birge and Rozenholc 
(2006). With the quadratic risk, the penalties proposed by Birge and Massart (1997) and 
Barron et al. (1999) also enjoy some optimality properties when applied to projection es- 
timators. The resulting estimators exhibit some adaptivity in the minimax sense with 
respect to Besov spaces for several appropriate functional bases (see Birge and Massart, 
1997). 

1.2 Cross-validation 

Unlike the aforementioned approaches relying on some deterministic penalties, the main 
concern of the present work is the use of cross-validation (CV) as a model selection proce- 
dure in the density estimation context. "Cross-validation" refers to a family of resampling- 
based procedures, resulting from a heuristic argument. The cross-validation procedures 
have been first studied in a regression context by Stone (1974, 1977) for the leave-one-out 
(Loo) and Geisser (1974, 1975) for the F-fold cross-validation (VFCV), and by Rudemo 
(1982) and Stone (1984) in the density estimation framework. 

Since these algorithms can be computationally demanding or even intractable, Rudemo 
(1982) and Bowman (1984) provided some closed-form expressions for the Loo estimator 
of the risk of histograms or kernel estimators. These results have been recently generalized 
by Celisse and Robin (2008b) to the leave-p-out cross-validation (Lpo). 

Most of theoretical results about the performance of CV procedures are asymptotic and 
mainly concern the regression framework. For a fixed model, Burman (1989, 1990) expands 
several CV estimators of the risk of Sm and concludes that Loo is the best one in terms 
of bias and variance. Besides several comparisons are pursued between CV and various 
penalized criteria: Li (1987) and Zhang (1993) in view of asymptotic efficiency, and Shao 
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(1993) and Yang (2007) on model consistency, that is recover the "true model". Interested 
readers are referred to Shao (1997) for an extensive review about asymptotic optimahty 
properties in terms of efficiency and model consistency of some penalized criteria as well 
as CV procedures. 

As for non-asymptotic results in the density setting, Birge and Massart (1997) have 
settled an oracle inequality that relies on a conjecture and may be applied to the Loo 
procedure. However to the best of our knowledge, no result of this type has already 
been proved for the Lpo procedure in the density estimation setup. Recently in the 
regression setting, Arlot (2007b) established oracle inequalities for y-fold penalties, while 
Arlot and Celisse (2009) have carried out an extensive simulation study in the change- 
point detection problem with heteroscedastic data. 

1.3 Main contributions 

The present paper is devoted to study CV procedures as a means to perform model 
selection in the density estimation framework. 

A constant drawback of CV — and resampling strategies in general — is the computation 
time such procedures involve. Indeed pursuing Loo with a large data set can be compu- 
tationally prohibitive. Closed-form expressions are provided for the Lpo estimator of the 
L^-risk of the broad class of projection estimators, demonstrating the wide applicability 
of these results. More insight is given into the behaviour of CV risk estimators thanks to 
these expressions, which drastically reduce the computation time. 

CV estimator is then embedded into the penalized criterion framework. It emphasizes 
the tight relationship between the choice of p and the amount of penalization resulting 
from this choice. In the model selection setting, the interest of choosing p > 1 raises as a 
way to balance over fitting phenomenon. 

Several non-asymptotic optimality results are also derived in terms of oracle inequality 
as well as adaptivity results in the minimax sense. To the best of our knowledge, these 
are the first theoretical non-asymptotic results of this type applying to Lpo in the density 
estimation setting. 

The paper is organized as follows. The next section describes the statistical framework 
and notation. CV is presented as a special case of resampling procedures and some exam- 
ples of famous CV procedures are provided. Closed- form expressions are then derived in 
Section 3 with several examples. Some bias and variance calculations are also yielded for 
various CV risk estimators. 

The main concern of Section 4 is model selection. The Lpo estimator of the risk is 
interpreted as a penalized criterion with a random penalty. The amount of penalization 
is quantified with respect to p, which stresses the interest of choosing p > 1 as a means to 
overcome overfitting. Two oracle inequalities are then derived that warranty the good non- 
asymptotic performance of Lpo as model selection procedure with polynomial collections 
of models. 
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Section 5 is devoted to adaptivity results in the minimax sense with respect to Holder 
as well as Besov spaces. Different collections of models are considered such as piecewise 
and trigonometric polynomials. A discussion with some possible prospects then follows in 
Section 6. Finally, proofs are collected in Section 7. 

2 Leave- j5-out cross-validation 

Resampling-based strategies such as CV are usually time-consuming and can even be 
computationally intractable. The interest of the forthcoming approach is to derive closed- 
form expressions for the CV-based estimator of the risk of projection estimators, which are 
widespread in the density estimation community (Rudemo (1982); Donolio et al. (1996); 
Birge and Massart (1997); Barron et al. (1999)). 

First, the statistical framework is described. A definition of projection estimators is 
yielded and illustrated by several examples. Second, the CV heuristics is detailed with an 
emphasis on the relationship between CV and resampling procedures. Several famous CV 
procedures are also recalled. 

2.1 Statistical framework 

Let us start with introducing the framework and some notation which are repeatedly 
used throughout the paper. 

2.1.1 Notation 

In the sequel, Xi, . . . , Xn G [0,1] are independent and identically distributed random 
variables drawn from a probability distribution P of density s S -^^^([0, 1]) with respect to 
Lebesgue's measure on [0, 1]. 

Let S* denote the set of mesurable functions on [0, 1]. The distance between s and 
any u G S* is measured thanks to the quadratic loss denoted by satisfying 

£ : {s,u) i {s,u) := \\s — u\\^ . 

Since this quantity depends on s that is unknown, let us introduce the associated contrast 
function 

7 : {u,x) I— > 7(n, x) := ||u||^ — 2u{x). 

This contrast is related to the loss function hy £ (s,u) = P^{u) — P'y{s), where Pj{u) = 
M[-y (u, X)] and X ~ P for any u £ S* . The empirical risk at point u G S* , which 
estimates £{s,u) up to a constant term, is defined by 

1 " 

7n(u) := Pn^{u) = - y^ -/{u,Xi) , 

i=l 
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where Pn = l/?^X^iLi^^i denotes the empirical measure. The quahty assessment of an 
estimator s = . . . ^X^) of s is made through the corresponding quadratic risk 

Rn{s) :=E[^(s, ?)] =E 

Let A^n denote a countable set of indices. For every m G Al„,, 5m, is a set of candidate 
functions to estimate s, called a model in the following. Since every model Sm is uniquely 
determined by its index, m is also called a model. 

In every model S'm, denotes an estimator of s defined as the empirical risk minimizer 
over Sm 

Sm ■■= Argmin„g5^^^P„7(u). 
The resulting collection of estimators {^m}m.eM,i corresponds to the collection of models 

{^rnlmf^Mn- 




2.1.2 Projection estimators 

Let An be a set of countable indices and {^^aIaga ^ family of vectors in L^([0, 1]) such 
that for every m S M.n, there exists A(?n) C A.„ and {'^A}AgA(-m) orthonormal 
family of L^([0, 1]). Then, let Sm denote the linear space spanned by {f\}x^A{ni) 
Dm = dim(5'm) for every m. 

The orthogonal projection of s onto 5m is denoted by Sm 

Sm ■■= Argmin„g5^P7(u) = ^ Ptpx fx, with Pipx = E[ipx{X) ] . 

AeA(m) 

Definition 2.1. An estimator s G L^([0,1]) is a projection estimator if there exists a 
family {fx}x(^A of orthonormal vectors of L'^ {[0,1]) such that 

s = ^ OA (fx, with ax = -'^ Hx{Xi), 
AeA " AeA 

where {Hx{-)} x,=a depends on the family {fx}xeA- 

Therefore, it turns out that for every m G Mn, the empirical risk minimizer over 5m is a 
projection estimator since 

1 

Sm= ^ PnfXfX, with Pnfx = -^fx{Xi) . 

AeA(m) i=l 
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Examples of projection estimators (see DeVore and Lorentz, 1993) 

• Histograms: 

For every m £ .Mn, let {A}AeA(m.) t)e a partition of [0, 1] in Card(A(m)) = Dm 
intervals. Set (fx = foi' every A G A{m), with \Ix\ denotes the Lebesgue 

measure of Ix- Then, the empirical risk minimizer, which is a histogram, is a pro- 
jection estimator 

s - V Plf 

•^m — / ^ n^lx I T I 

A6A(m) ' 

• Trigonometric polynomials: 

Let {fx}x^z be the orthonormal basis of L^{[0, 1]) such that 1 1-^ fx{t) = e^'^*'^*. For 
any finite A(m) C Z, the trigonometric polynomial 

t^Sm{t)= ^ PnfXe^^'^' 
XeA{m) 

is a projection estimator. 

• Wavelet basis: 

Set {fx}x£An orthonormal basis of L^([0, 1]) made of compact supported wavelets, 
where A„ = {(j, k) \ j £ W and l<k<2^]. For every subset A(m) of A„, the 
empirical risk minimizer associated with {fx}xeA{m) 

Sm= ^ PnfX fx- 
AeA(m) 

2.2 Cross-validation 

First, CV is presented as a particular instance of subsampling, which enables to yield a 
unified description of CV procedures. Then, several CV procedures are detailed with an 
emphasis on leave-p-out cross-validation (Lpo) that will be further studied in the following 
of the paper. 

2.2.1 Resampling 

A resampling procedure consists in generating new sets of observations — the resamples — 
from the original sample according to a given scheme. Resampling corresponds to sub- 
sampling when the resample cardinality is less than that of the original sample. 

Among first resampling procedures, a primitive version of CV has been performed 
by Larson (1931) at the early 30s, while jackknife was introduced by Quenouille (1949) 
and also studied by Tukey (1958). However, resampling procedures have only emerged 
as a worthwhile matter of study following the work by Efron (1979, 1982) on bootstrap. 
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Interested readers are referred to (Arlot, 2007a, Introduction) and (Celisse, 2008, Intro- 
duction) for a general point of view about resampling and also Gine (1997) for some more 
references. 

Following the heuristics described by Efron (1979), resampling approximates the un- 
known distribution of a statistics by that of " resampled statistics" , that is statistics com- 
puted from the resamples, given original data (see Mason and Newton, 1992, for examples 
of theoretical results). 

Let Xi^n = {^1) • • • )^n} denote original observations and = {X^, . . . ,X'^} for 

N < n, some resampled data. Then the empirical distribution of X^ ^ is equal to 

1=1 i=l 

where Wn = {Wn^i, • • • , Wn,n) denotes a weight vector, which is specific of the resampling 
scheme. Weights Wn,i are random variables drawn independently from XiS according to 
a known distribution. 



2.2.2 Cross-validation rationale 

Unlike bootstrap, CV does not aim at recovering the distribution of a given statistics. 
More precisely, CV is devoted to estimate the risk of an estimator s (P„) of s. Notation 
's{Pn) stresses the dependence of s on original observations through empirical measure. 
The risk of s can be expressed as 

r„(?)=Ex,,jEx[7(?(^™),^)]], (1) 

where X denotes a new observation, independent from „ and identically distributed. 
Ex and Ex, „ are expectations with respect to X and respectively Xi ^- 

The crux in the CV heuristics is the independence between X and Xi^n, which arises 
from (1). This point is at the core of any CV procedure and justifies the splitting of Xi^n 
into a training set — used to compute the estimator — and a test set — used to assess the 
quality of the latter estimator. 

Since the training set plays the same role as the initial sample but with cardinality 
less than n, it acts as a subsample of Xi^n- Therefore, this subsampling scheme can be 
defined by the choice of some random weights Wn = (VFn,ii • • • > Wn,n)- Details about CV 
procedures and corresponding weights are provided in Section 2.2.3. 

Let and respectively denote empirical measures of data in the training set, 
resp. in the test set. For a given split of the data, the CV estimate satisfies 

PYl{s{P^))^Exh{s{Pr.),X)]. 

The left-hand side quantity depends on the realization of the random weights, which can 
be removed by integrating with respect to them: 



Rcv,w '■= 



P^l ( s (Pf)) « Ex,,„ [Ex[l{s{Pn),X)]]=rn{s: 
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where E^y means integration is carried out with respect to the weights. 

Rcv,w denotes the CV estimate of the risk of s , up to a constant. The notation points 
out that it depends on the choice of the weight distribution (see Section 2.2.3). 

In the sequel, r„(s) is repeatedly used and referred to as the risk of 's. 

2.2.3 Cross-validation botany 

Several CV procedures are described with their associated weights. A distinction is made 
between time-consuming and computationally efficient ones. VFCV and Lpo are then 
compared one another in several respects. 

In the following, for any l<p<n — 1, £p denotes the set of all possible subsets of 
{1, . . . , n} with cardinality p. 

Hold-out From a historical point of view, simple validation also called Hold-out (Ho) 
has been introduced at the early 30s. For instance, it is employed by Larson (1931) in 
his empirical analysis. Hold-out simply consists in randomly splitting observations into a 
training set of cardinality n — p and a test set of cardinality p, with \ < p < n — 1. Data 
splitting is only made once, which results in additional variability. 

Since it is easy to analyse, hold-out has been often studied: see for in- 
stance Bartlett et al. (2002); Blanchard and Massart (2006) in classification, and 
Lugosi and Nobel (1999); Wegkamp (2003) in regression. 

For any random choice of e G £-p, the hold-out estimator of r„(?) is 

^Ho,p( S) := P^i-f {s{P:!)) = ^ E ^ ( ^(^l,n)' ^0 ' 

where (resp. P^) denotes the empirical distribution of data in the test set (resp. in 
the training set). Hold-out corresponds to the random choice of Wn such that for every 
i-, Wn,i £ {0, n/p}, Yl^=i ^n,i = n, and Wn is drawn from a Dirac measure over the (p 
such vectors. 

Leave-one-out Leave-one-out (Loo) was the first CV procedure, since strictly speaking 
CV starts when simple validation is carried out for several splits of the data. It was 
first formalized by Mosteller and Tukey (1968), and then studied in the model selection 
framework by Stone (1974). It consists in successively removing each observation from the 
original data, using the n—1 remaining ones to compute the estimator. The performance of 
the latter estimator is then assessed thanks to the removed point. The Loo risk estimator 
is defined as the average performance assessment over the n possible splits: 

1 " 

Ri{s) = -Y,iis{xi%),x,), 

1=1 

where ^ represents Xi^n from which Xi has been removed. 
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In order to stick to the resampling formalism, Loo corresponds to the choice of a random 
vector Wn, such that Wnj G {0,n}, F{Wnj > 0) = 1/n for any j, and ^1]=! ^n,j = n. 

Leave-p-out Leave-p-out (Lpo) generalizes Loo to the case where 1 < p < n — 1 obser- 
vations are removed from original data at each split. 

It is studied in linear regression setup by Shao (1993) and Zhang (1993), and in 
the change-point detection setting by Arlot and Celisse (2009). In density estimation, 
Celisse and Robin (2008b) derive closed-form expressions for the Lpo estimator with his- 
tograms and kernels. 

The Lpo estimator of r„(s') consists in the same procedure as Loo except that at each 
one of the (^) possible splits, p observations are removed: 



Rp{s] 



^ ee£p L iee 



Py 



The corresponding weights satisfy Wn,i G {0, n/p} for any i, Yl~i=i ^n,i = and the 
probability of any such vector is (^ 

Remark 1. A naive implementation of Lpo has computational complexity of order (p) 
times that of the s computation, which is intractable as soon as n is large and p > 1. 
Even Loo, that is Lpo with p = 1, can be time-consuming. 



F-fold cross-validation Due to the high computational burden of the previous proce- 
dures, Geisser (1974, 1975) has introduced an alternative procedure named V-fold cross- 
validation (VFCV). For instance, it has been studied in Burman (1989, 1990) who suggests 
a correction to remove some bias. 

VFCV relies on a preliminary (random or not) choice of a partition of the data into 

V subsets of approximately equal size n/V. Each subset is successively left out, and the 

V — 1 remaining ones are used to compute the estimator while the last one is dedicated 
to performance assessment. The V-fold risk estimator is the average over the V resulting 
estimators. 

For a given random partition of the data, the above description results in V weight 
vectors Wn of respective probability 1/V, satisfying Wni G {0,1^} for any i, and 

Let ei,...,ey denote the partition of {l,...,n} into V blocks. Then, the VFCV 
estimator of r„(s') is 

V 



^VFcvy(?) = ^E ^E^(^(^")'^0 

v=l L jSe„ 
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2.2.4 Lpo versus VFCV 



Nowadays, it is usual to deal with a large amount of data. For instance, biology as well 
as computer vision are perfect illustrations of this statement. 

As explained in Section 2.2.3, the Loo computational complexity is n times that of s, 
which can be highly time-consuming. With this respect, provided V <^ n, VFCV (Geisser, 
1974, 1975) is by far less computationally demanding than Loo. 

However, VFCV relies on a random partitioning of the data into V subsets. This addi- 
tional randomness induces some more variability with respect to Loo and Lpo, which both 
carry out exhaustive splitting of the observations. In the density estimation framework, 
Celisse and Robin (2008b) has theoretically quantified the amount of randomness induced 
in applying VFCV instead of Lpo. 

As it does not introduce any additional variability, Lpo can be seen as a "gold stan- 
dard" among CV procedures. VFCV turns out as an approximation of the "ideal Lpo", 
which is unachievable due to prohibitive computation-time. Indeed, the Lpo computa- 
tion requires to explore resamples, which is intractable even for not too large n when 
p > 2. Therefore, with full generality, Lpo cannot be performed and one has to use 
approximations . 

Other approximations to Lpo exist like repeated learning-testing cross-validation 
(RLT), introduced by Breiman et al. (1984) and then studied in Burman (1989) and Zhang 
(1993). 

The purpose of the next section is to describe a broad range of settings in which 
closed-form expressions of the Lpo estimator can be derived. On the one hand, such 
formulas drastically reduce the Lpo computational complexity from exponential — for a 
naive implementation of Lpo — to linear. Furthermore, such formulas make Lpo preferable 
to VFCV since the latter is more variable and expensive to perform. 

On the other hand, these closed-form formulas yield more insight in the general be- 
haviour of CV as an estimator of the risk. The study of CV as a model selection procedure 
is the main concern of Section 4. 

3 Closed-form expressions 

Closed-form expressions of the Lpo risk estimator are provided for the broad family of 
projection estimators. First, such formulas enable the efficient computation of Lpo. Sec- 
ond, they provide some information about the quality of the CV estimator as an estimator 
of the risk. Indeed, closed-form expressions for bias and variance of the CV estimator are 
also derived. 

3.1 Leave-p-out risk estimator 

Here is an elementary and essential lemma that leads to these closed-form expressions. 
This result is obtained thanks to combinatorial calculations. 
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Lemma 3.1. Let 'SmiXf^) denote a generic projection estimator based on model Sm and 
computed from the training set Xf^. Then, 



{n — pY 



n- 1 
P 



1 /n-2 



n — p\p 



^ fc=lAeA(m) 

i)E E 



(2) 
(3) 



if^j AeA(m) 



The proof of Lemma 3.1 is deferred to Section 7. 

From the previous lemma, the closed-form expression for the Lpo estimator of the risk 
is derived. This expression holds with the quadratic loss and projection estimators. 

Proposition 3.1. For any m £ Mn, le-fsm denote the projection estimator onto the model 
Sm, spanned by the orthonormal basis {'/'A}AeA(m) • "^hen for any p £ {1, ... ,n — 1}, 



n[n — p) 



E 

AeA(m) 



n — p + 1 



n 



(4) 



The computation cost of (4) is of order O (n). 



Proof. In the density estimation framework, the contrast associated with the L^-loss is 
7(t,X) = — 2t{X). Subsequently, the Lpo estimator is 



Rp{m) 



n 



-1 o / \ -1 

i2 2 / n 



E Pm{XlJ\ 



p\p 



^^SmiXlJiX, 



Besides, the general projection estimator is 



AeA(m) 



The simple application of (2) and (3) provides the expected conclusion. 



□ 



Examples We are now in position to specify the expression of the Lpo risk estimator in 
Proposition 3.1 for several projection estimators. 



1. Histograms: 



11 



Corollary 3.1. Let us assume that 'Sm denotes the histogram estimator built from 
the partition I{m) = (Ji, . . . , lom) of [0, 1] in Dm intervals of respective length \Ix\. 
Then for p £ {1, . . . ,n — 1} , 



Rp{m) 



(n - l)(n -p) — i^Al 



Dm T 



(2n — p) n(n — p + 1) 



n 



\ n 



(5) 



where nx = ^{i\ Xi £ Ix}- 



Proof. (5) comes simply from the application of (4) with ipx = 1/^/ v I-^aI 
2. Trigonometric polynomials: 



□ 



Corollary 3.2. Let (fx denote either t cos(27rA;t), if X £ 2N or t sin(27rfct), if 
A G 2N + 1. 

Let us further assume that A{m) = {0, . . . , 2K} for an integer K > 0. Then, 



Rp{m) 



{p-2){K + l) 
(n — l)(n — p) 

n — p + 1 
n{n — l)(n — p) 



K 

E 

fc=0 



i2 ( N 2- 

I n 
+ <! ^ sin(27rA;X 



3. Haar basis: 



Corollary 3.3. Set : t i— > l[o,i] ci-f^d (fj^kit) = 2^/'^ip{2^ ■ —k), where j G N and 
0<k<2^ -1. 

For any m £ Mn, let us define A(m) C {(j, fc) | j G N, < < 2-'' - 1} . The 



len, 



Rp{m) 



(n — l)(n — p) 



(2n—p)— n{n—p+\)' 



n 



n 



(j,k)£A{m) 

where nj^k = Card {{i \ X, £ [k/2^, {k + l)/2^]}) . 
3.2 Moment calculations 

As a consequence of the closed-form expressions settled in the previous section, similar 
expressions are also available for expectation and variance. A precise assessment of the 
performance of the Lpo estimator as an estimator of the risk is thus available thanks to 
these closed- form expressions. 
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Proposition 3.2. With the same notations as in Proposition 3.1, we have for any 1 < 
p < n — 1, 



Var 



Ei?p(m) 
Rp{m) 



n — p 



XeA{m) AeA(m) 

A A \ A / 

-7ia^ (Y P^l] + E (^'^A</'A')' + 4/3't2lE [e '/^a^'/'a) 

V A / A^A' V A / 

+(-4n + 6)ti/?2 (Pv^a)') + 4a/?ti ^ P^l^x'P^X' 

V A / At^A' 



(n(n — l)(n — p))" 



A A' 

where Pipx = Kipx{X), a = n — 1, [5 = n — p + 1, ti = n{n — 1), and t2 = ti(n — 2). 



The technical proof is given in Section 7. Note that these formulas may be derived provided 
P|93a|^ < +CXO for any A G A(m), which is satisfied if s is assumed to be bounded and 
/ Iv'aI^ < +00 (ipx continuous and compact supported for instance). 

The bias of the Lpo risk estimator may be a more interesting quantity to work with. 
Its expression straightforwardly results from Proposition 3.2. 

Corollary 3.4. For any projection estimator, the bias of the Lpo estimator is equal to 



Rp{m) 



:= Ei?p(m) — rn{m) 



n{n — p) 
P 



AGTT! 



n{n — p) 



Y Var[(/PAW] >0, 



AGA(m) 



where r„(m) = E 



2/f 



[0,1] 



S Sr 



Illustration By application of Proposition 3.2 to histogram estimators, the follow- 
ing expressions are derived for expectation and variance of the Lpo risk estimator (see 
Celisse and Robin, 2008b): 

Corollary 3.5. For every A G A(m), set ax = P(Xj G Ix)- Then, 

I ^ I . 1 



E 



Var 



Rp{m) 



Rp{m) 



n-p ^ ux 

A£m 



V —ax (1 - oa) - V — 



Agm 



wa 



"A> 



p'^q2{n,a,uj) + pqi{n,a,uj) + qo{n,a,uj) 
[n{n — l)(n — p)]^ 
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where 



D 

V(i,j) G {1,...,3} X {1,2}, Si,,=Y,o^i/^i 

k=l 

q2{n,a,uj) = n(n - 1) [2s2,2 + 4s3,2(f^ - 2) + s|i(-4n + 6)] , 

qi{n,a,Lo) = n(n - 1) [-8^2,2 - 8s3,2(n - 2)(n + 1) - 4si,iS2,i("- ~ 1)~ 
2s|i(-4n2 + 2n + 6)] , 

qo{n,a,uj) = n(n - 1) [sl,2(^^ - 1) - 2s2,2(?^^ - 2n - 3) + 
4s3,2(n - 2)(n + 1)^ - ^(n - 1)+ 
4si,iS2,i(n2 - 1) + 4i(-4^ + 6)(n + 1)2] . 

4 Model selection 

Although CV is extensively used in practice, very few is known about its non-asymptotic 
behaviour as a model selection procedure. In particular, there is no theoretical and non- 
asymptotic guideline about the optimal choice of p with respect to the model selection 
goal one pursue. 

The purpose of the present section is first to analyze CV as a penalized criterion, which 
enables a new interpretation of the choice of p. Second, the performance of CV in terms 
of model selection procedure is quantified by non-asymptotic optimality results. To the 
best of our knowledge, these oracle inequalities are the first results of this type. 

4.1 Random penalty 

This section sheds new lights on the behaviour of CV as model selection procedure with 
respect to the choice of p. On the one hand, CV is embedded in the framework of model 
selection via penalized criteria. It is shown that the choice of p determines the amount of 
penalization. 

On the other hand, several conclusions are drawn about the appropriate — non- 
asymptotic — use of CV, depending on the value of p. For instance, since it behaves like 
Mallows'Cp, Loo must not be employed as a model selection procedure with exponential 
collections of models. 

4.1.1 Ideal and Lpo penalties 

Given the a priori knowledge of the target s, a countable collection {Sm}meM„ chosen 
so that the SmS are assumed to be close to s. The purpose of model selection is to design 
a procedure providing a candidate model m such that the final estimator s ^ is as close 
as possible to the target s. 
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For instance, the choice of fh is made by minimizing a penahzed criterion crit(-) 
(Barron et al., 1999) defined by 

Vm G Mn, crit(7n) = P„7 (sm) + pen(m), (6) 

where P„7 (sm) is the empirical risk of Sm- pen(-) : Ain — > denotes the penalty term, 
which takes into account the complexity of model Sm- 

On the one hand, the optimal criterion to minimize over Ain is the ideal random 
quantity 

ciitidim) = P-fism) ■■= IE7 {Sm, X) (7) 

where the expectation is taken with respect to X ~ P, which is independent from the 
original data. The minimization of the ideal criterion critj^ over Mn would systematically 
yield the best estimator one can achieve among {sm}m&Mn' ^^^^ oracle. The link 

between (6) and (7) can be clarified by rewriting 

Critidim) = Pnl{Sm) + [P^{Sm) - Pnl{Sm)] , 

so that the ideal penalty is defined by 

Vm G Mn, penj^(m) := ^7(3^) - Pnjism)- 

The ideal penalty is what must be added to the empirical risk to recover the ideal criterion. 

On the other hand following the CV strategy, we perform model selection by minimizing 
the Lpo risk estimator over Mn- Thus for a given 1 < p < n — 1, the candidate fh satisfies 

fh = Argmin^g_v(„^p(m). 

The existence of a strong relationship between penalized criteria and CV procedures is 
strongly supported by the large amount of literature about (asymptotic) comparisons 
of these two model selection procedures (see for instance Stone, 1977; Li, 1987; Zhang, 
1993). Therefore, the CV strategy can be embedded into penalized criterion minimization 
procedures: 

fh = Argmin^g;^;^ {Pnl{sm) + penp(m)} , 

where penp(m) is called the Lpo penalty of model m and satisfies for every m penp(?7i) := 
Rp{m) — PnjCsm)- A somewhat related approach applied to Loo can be found in 
Birge and Massart (1997). 

4.1.2 Lpo overpenalization 

This embedding of CV into penalized criteria provides more insight in the behaviour of CV 
procedures with respect to parameter p. In particular, some features in the behaviour of 
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peiip as function ofp arise from the comparison between pen^^ and pen^. This comparison 
is carried out through expectations of these penalties. The next results hold with general 
projection estimators. 

Let us start with a preliminary lemma: 

Lemma 4.1. With any projection estimator 'Sm onto Sm, we obtain 
E[pen,^(m)] = - Yl ^^H^x{X)) , 

AeA(m) 

E[pen,(m)] = Yl Var(^,(X)). 

^ ^' AeA(m) 

This enables to precisely evaluate, the discrepancy between Lpo and ideal penalties: 

Proposition 4.1. For every m G Ain, l&t \i^K(m) denote an orthonormal basis of Sm 
and 'Sm, the projection estimator onto Sm- Then, for every m G and I < p < n — I, 

E [penp(m) - pen,^(m)] = ^ Yl (MX)) > 0. (8) 

^ ^' AeA(m) 

Whatever 1 < p < n — 1, the Lpo penalty remains larger than the ideal one by an 
amount that increases with p. Furthermore, this amount of penalization can vary within 
a wide range of values. Indeed, (8) yields 

E [penp(??T,) ] = Cover (p) E [penj^(?n) ] , 

where Cover(p) = (2n - p) /(2n - 2p). Therefore with p = 1, Covcr(l) = 1 + l/(2n - 2) 
leads to a nearly unbiased estimator of the ideal penalty, while Cover {n/ 2) = 3/2 indicates 
that the Lpo penalty overpenalizes by an amount of the same order as the ideal penalty. 
A logn factor can even be achieved by Cover (p) provided p « (1 — 1/ (21ogn — 1)) n. 

At this stage, an important distinction must be made between risk estimation and 
model selection. On the one hand, if the purpose is the estimation of the risk of a given 
estimator, for instance, an unbiased (or nearly unbiased) estimator of this risk can be 
desirable. Therefore, the choice p = 1 seems the most appropriate one, provided the 
variance of the resulting estimator remains at a reasonable level. As for "optimal" risk 
estimation, Celisse and Robin (2008b) has developed a strategy aiming at providing the 
Lpo risk estimator with the smallest mean square error. In Celisse and Robin (2008a), it is 
shown that the proposed estimator asymptotically amounts to Loo as n tends to infinity. 
This is also consistent with the results of Burman (1989) who shows in the regression 
setting that Loo is asymptotically the best risk estimator among CV ones in terms of bias 
and variance. 

On the other hand, model selection requires to choose the closest model to the tar- 
get, even at the price of a worse estimation of the risk for some models in the collection. 
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For instance, minimizing a penalized criterion over Mn leads to misleading models with a 
probability increasing with Mn, provided the oracle remains the same. This results in ran- 
dom downwards deviations of the minimized criterion for some "bad models". A classical 
way to balance these unwanted deviations is to overpenalize by an amount that depends 
on the structure of the considered collection of models. Subsequently, choosing p = 1 
is not necessary desirable in model selection as already noticed in Breiman and Spector 
(1992) and recently in (Celisse, 2008, Chap. 6) where Loo has been empirically shown to 
suffer from overfitting with polynomial collections of models. 

Several conclusions can be drawn from Proposition 4.1 about the CV performance as 
a model selection procedure. First, Loo is a nearly unbiased risk estimator, which results 
in similar behaviour to that of Mallows'Cp. This is consistent with asymptotic results 
established by Li (1987) and Zhang (1993). As a consequence, Loo only aims at yielding 
a reliable estimation of the target in order to perform (asymptotically) efficient model 
selection (see Section 1 and Li (1987)). In particular. Loo cannot be employed — with an 
identification purpose — to recover the "true model" with probability converging to 1 as n 
tends to infinity, which is the goal of BIG. 

Second, with Loo — and Lpo for small values of p — only model selection over polyno- 
mial collections of models can be carried out. For instance, using Loo with exponential 
collections of models would systematically lead to overly large models. 

Third, identification can however be pursued by CV, provided p has been chosen of 
the appropriate order. Indeed, p ~ (1 — 1/ (21ogn — 1)) n yields a logn term like the one 
in BIC penalty. This also confirms the previous asymptotic result settled by Shao (1993) 
in the regression setting. 

4.2 Oracle inequalities 

In the following, the quality of the Lpo-based model selection procedure is assessed through 
the statement of oracle inequalities. These results are settled in the polynomial complexity 
framework and hold for any projection estimator. To our knowledge, it is the first non- 
asymptotic results about the performance of Lpo in this framework. 

Unlike the usual approach in model selection via penalized criterion, the purpose here 
is not to design a penalty function. Indeed, the Lpo estimator itself can be understood as 
a penalized criterion (see Section 4.1). 

4.2.1 Preliminaries 

The main results rely on several assumptions detailed and discussed in the following. 
Set X ~ s and for every index m, 




AGA(m) 



Then, let us define the following assumptions: 
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(Reg) 3$ > 0/ sup^g^„ ||0„,||^ < $ n/ {lognf. 

Since = SAeA(m) V'a' ll'^mlloo be understood as a regularity measure of the 
basis {9'A}AeA(m)- Thus, {Reg) relates the regularity of the considered basis to the amount 
of data. This assumption has already been used by Castellan (2003) for instance. Let us 
assume we use histogram estimators based on a partition {/i, . . . , I/j^} of [0,1] in 
intervals, and that ip\ = I/^/y^PaT, where |/a| is the length of I\. Then, {Reg) gives 
a lower bound on the minimal length of any interval I\ of the partition with respect to 
the number of observations. In other words, partitions made of intervals with less than 
n/(logn)^ observations are prohibited. 



(Reg2) 3$ > | Vm e Mn, sup|„|^=i ||^;, a^xWoo ^ V^VOog^- 

{Reg2) is another regularity assumption about {y'A}AGA(m)- specific case of a 

basis defined from a partition of [0, 1] (like histograms or piecewise polynomials), {Reg) 
implies {Reg2) . Besides, the constant ^ is assumed to be the same in {Reg) and {Reg2) 
, which holds up to replacing one of them by their maximum. A similar requirement to 
{Reg2) can be found in Massart (2007). 



(Ad) 3^ > 0/ Vm G Mn with Dm > 2, nE 



Let us first notice that 



Ells, 



AeA(m) 



E ^Var[c^A(X)] 



AeA(m) 



With histograms, Var [(/;a(^)] vanishes if and only if the support of s is included in /a- 
{Ad) therefore requires that for any m, there are always "enough" informative basis vectors, 
if an informative vector is a vector such that Var [99a(^)] 7^ 0. For instance a sufficient 
condition for {Ad) to hold with histograms is s > p > on [0, 1]. This assumption can 
also be found in Massart (2007). 



(Pol) 36 >0/\/D> 1, \{m eMn\Dm = D}\ < D\ 

A model collection is said to have a polynomial complexity if {Pol) holds, that is 
if the cardinality of the set of models with dimension D is polynomial in D. Such an 
assumption is satisfied with nested models for instance (Birge and Massart (1997)). It 
straightforwardly implies that Card(A^,i) < n^'^^ . 
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4.2.2 Main results 



In the following, two oracle inequalities are settled, which warranty the ability of the 
Lpo procedure to select an efficient density estimator. The first result holds with bounded 
densities, while the second one concerns the more general case of square integrable densities 
at the price of an additional assumption. Several instances of bases for which the latter 
assumption holds are provided at the end of this section. 



Bounded density 



Theorem 4.1. Let s denote a bounded density on [0,1] and Xi, . . . , X„ be n i.i.d. random 
variables drawn from s. Set {v'aIaga^ ^ finite family of bounded functions on [0, 1] such 
that for any m G Mn, Sm denotes the vector space of dimension Dm, spanned by the 
orthonormal family {¥'A}AeA(m)- ^'^ assume that {Reg) , {Reg2) , {Ad) and {Pol) 
hold. 

For n > 29, set < e < 1 such that 



1 + 3C(e) ^ n ^ 



< 1, 



where C(e) = 
{Ran) 



K{e) 



C(e)(n-l)-2 
. Then for any 1 < p <n — 1 satisfying 

2 



(9) 



2 1 + C(e) P 
1 + 3C(e) + nl + 3C(e) + " " n " 



/3 



with < a, /3 < 1, we have 



E 



<r(e,a,/3) inf E 

m&Mn 



C(e)(n-l)-2 

^ K{e,s,^,a,l3,5) 
n 



where T{€,a,f3) > 1 is a constant (with respect to n) independent from s and 
K{e, s,^,a, (3,6) > is another constant. 

The proof of this result is deferred to Section 7. 
Remarks: 

• {Ran) is a sufficient condition for the oracle inequality to hold. In this assumption, 
a and (3 can be chosen as small as we want, but cannot vanish. 

• The existence of e satisfying the inequality (9) stems from a technical lemma given 
in the proof of Theorem 4.1. 

• As it is made clear from the proof of the aforementioned technical lemma, the choice 
of e is constrained. For instance, e cannot be too much close to 0. This explains 
why the nonintuitive bounds in {Ran) cannot be easily simplified. Furthermore, 
this enlightens that "small values" of p could be excluded from the range of values 
described in {Ran) , to which the oracle inequality applies. 
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The independence of r(e, a, (3) from s is essential in our framework since we have in 
mind the use of this result to derive adaptivity in the minimax sense properties. 



Square-integrable density 

The second result is derived following the same idea as the previous one, thanks to 
an additional mild assumption on the considered bases. This requirement turns out to be 
non restrictive at all, since it is met by a broad class of orthonormal bases. 

Theorem 4.2. Let s denote a density in L^([0, 1]) and Xi, . . . , Xn be n i.i.d. random 
variables drawn from s. We set {v^aIasAu ^ finite family of bounded functions on [0, 1] 
such that for any m G A4.n, Sm denotes the vector space of dimension Dm, spanned by 
the orthonormal family {v5A}AGA(m)- ^-^ assume that {Reg) , {Reg2) , {Ad) and {Pol) 
hold, and moreover that 



{Reg?,) 3$ > 0/ Vm G Mn, 

For n > 29, set < e < 1 such that 

4C(e) 2 . 2 



2 

1 + 3C(e) + n ^ ^ 



< 1, 



where C(e) = 
{Ran) 



1 



(1 + ^)-^ 
K{e] 



C(e)(n-l)-2 
. Then for any 1 <p <n — 1 satisfying 

2 



2 1 + C(g) P_ 
1 + 3C(e) ^ nl + 3C(e) + " " n " 



with < a, /3 < 1, we have 
E 



<r(e,a,/3) inf E 

meMn 



C(e)(n-l)-2 
K(e, s, a, /3, 5) 



+ 



n 



where r(e, a,/3) > 1 is a constant (with respect to n) independent from s and 
K{e, s,^,a, (3,5) >0 is another constant. 

For the sake of clarity, the proof is also deferred to Section 7. Since it is very similar to 
that of Theorem 4.1, only the main differences are detailed. 

Remark 2. Assumption {Reg3) is quite different from {Reg) . Whereas the latter relates 
the "regularity" of any basis to the number of observations uniformly over A4n, (-Re^S) 

for every model by means of its dimension. All models with the 

remains 



rather controls ||(/),r 

same dimension must be somehow alike since their associated sup-norm 
upper bounded by <&Dm- This assumption can be found in Birge and Massart (1997) as 
well. 
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Examples Several examples of widespread functional bases are now detailed to illustrate 
the high generality level of assumption {Reg3) . 

• It is easy to check that {Reg3) applies to regular histograms with $ = 1 (Section 
5.3). 

• A typical example of basis satisfying (RegS) is the trigonometric basis. For m G N, let 
A(m) = {0, . . . , 2m} denote a set of indices where (po = l[o,i]) fxit) = \/2 sin(2/c7rt) 
if A = 2fc - 1 and ipx{t) = \/2 cos(2A;7rt) if A = 2k. 

Then, 

m 

VtG[0, 1], (filit) = l + 2^(cos2(2/c7rt) +sin2(2A;7rt)) , 

AeA(m) k=l 
= 2m + 1. 

Since Dm = 2m + 1, it comes that ||<^m|loo ~ and (RegS) holds with $ = 1. 

• Barron et al. (1999) (Lemma 7.13) proved that with piecewise polynomials on a 
regular partition of [0, 1] with degree not larger than r on each element of this 
partition, 

UmW^ < {r + l){2r + l)Dm. 
The resulting constant $ = (r + l)(2r + 1) is subsequently independent from m. 

• Haar basis: For any positive integer j, we introduce A(j) = {ij,k) \ < k < 2^ - l}. 
Furthermore, set if = l[o,i/2) ~ -"-[1/2,1] for any A = (j, k), let us define ^j,k{t) = 
2^1'^ip (2H — + 1) on [0, 1]. For a positive integer m G J^n, let us consider Sm as 
the linear space spanned by {'/'aIasu < A(j)- Then, it can be seen that 

ll'^mlloo = Dm 

since for each j, there is only one < A: < 2-' — 1, which contributes to the sum in 

4>m- 

For more general wavelet bases, an upper bound — uniform with respect to m — can 
be established (see Birge and Massart, 1997, for instance). 

5 Adapt ivity 

In this section, the idea is to apply theorems of Section 4.2 to derive several adaptivity 
results in the minimax sense with respect to Holder as well as Besov functional spaces. 
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5.1 Adaptivity in the minimax sense 



Let us assume that s belongs to a set of functions T{9), indexed by a parameter 9 £ Q, 
and define an estimator s of s. 

An estimator 's is said to be adaptive for 6 if, without knowing 9, it "works as well 
as" any estimator which would exploit this knowledge. 

Definition 5.1. An estimator 's is said to be adaptive for 9 if its risk is nearly the same 
as the minimax risk with respect to T (9), that is if there exists C > 1 satisfying: 



inf sup E 


II ^l|2 

\\s — s\\ 


< sup E 


II ^-l|2 

II '5 'S m II 


< Cinf sup E 


II ^l|2 

s — s \\ 


■5 seT{e) 








■5 seTid) 





where the infimum is taken over all possible estimators. 

Furthermore if this property holds for every parameters 9 in a set G, then 's is said to be 
adaptive in the minimax sense with respect to the family {T{9)}g^Q. 

Interested readers are referred to Barron et al. (1999) for a unified presentation about 
various notions of adaptivity. 

Remark 3. Very often, C > 1 depends on the unknown parameters 9, but neither from s 
nor from n. 



5.2 Description of the collections of models 

Since such optimality results depend on the approximation properties of the considered 
models, three different model collections are described in the following, each one being 
defined from a specific family of vectors {9?A}AeA„- 

5.2.1 Piecewise constant functions (Pc) 

For a given partition of [0, 1] in D regular intervals (A)AeA(m) °f length 1/D and m G Aim 
let us define the model 

Sm = lt\t= ^ axifx, {ax)x G IK > , 
[ AGA(m) J 

where (px = ^i^/ \/\Ix\ and \Ix\ denotes the length of Ix- Sm is the vector space of 
dimension Dm = D spanned by the orthonormal family {^x} xeA{m)- made of all 

piecewise constant functions defined on the partition / = (/i, . . . jId^)- 
Thus with each index m S Mn, we associate the linear space Sm of piecewise constant 
functions defined on a regular partition of [0, 1] in Dm intervals of length 1/Dm- Moreover, 
let Nn = maXmeMn Dm be the maximal dimension of a model belonging to the collection. 
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5.2.2 Piecewise dyadic polynomials (Pp) 

Set A4n = {0, . . . , Jn} and for any m G Aim Sm denotes the linear space of functions 

t = ^ -Pfe l[fc2-'",(fc+l)2-'")) 
k=0 

where the P^s denote polynomials of degree less than r. The dimension of Sm is subse- 
quently defined by 

Dm = r2'^ and iV„ = max = r 2-^". 

m&Mn 

With this collection of models, {Pol) is satisfied since there is at most one model for each 
dimension. 

5.2.3 Trigonometric polynomials (Tp) 

Set Ain = {0, ...,Jn}, where J„ is a positive integer. For any m G Mn-, let A(m) = 
{0, ...,2m} denote a set of indices such that ^o{t) = l[o,i]5 = \/2 sin(2fc7rt) if 

A = 2A; - 1 and ipx{t) = \/2 cos(2A:7rt) if A = 2k. 

Then, Sm is the linear space spanned by Wx} \,z\{m)^ dimension Dm = 2m + 1. Any 
t G Sm can be expressed as 

m 

Vx G [0, 1], t(rE) = ao + |^afc\/2cos(27rA;a;) + 6fcV^sin(27rA;x) , 

k=i 

the OfcS and 6^8 belong to M. 

Moreover, J„ and A^^ are related by the following relationship Nn = 2J„ + 1. 
5.3 Holder functional space 

The purpose is to show that the Lpo-based approach enjoys some adaptivity when s 
belongs to an unknown Holder space a) for L > and a G (0, 1]. Let us recall that 
a function / : [0, 1] — > M belongs to a) with L > and < a < 1 if 

Vx,yG[0,l], <L|x-yr. 

For an extensive study of functional spaces, (see DeVore and Lorentz, 1993). 

In order to achieve this goal, s is approximated by piecewise constant functions, using 
the model collection (Pc) described in Section 5.2.1. The histogram estimator built from 
model 5m is defined by 

Sm= > Pn^x f\= > — -n-r, 

AeA(m) AeA(m) ' ' 
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where nx = Card {{i \ Xi G 

In the sequel, the assumptions of Theorem 4.1 are checked in order to derive the desired 
adaptivity property. 

• With the collection (Pc), m ^ Dm is a one-to-one mapping from towards 
T> = {Dm I rn S Mn}^ which entails that {Pol) is satisfied since the collection is 
made of only one model for each dimension. 



Since Lpx = tI^/^/\I. 



I vXit) I = max 

G[0,1] \AGA(m) / ^ ^ I ^1 



Dm. 



Thus, {Reg) amounts to require that 

max Dm = Nn < $n/ (logn)^ , 

m 

which means that on average, there are at least about (log n)^/n points in each 
interval of any partition we consider. 

• We therefore assume that {Reg) , {Ad) and {Ran) hold. 

As for the problem of density estimation on [0, 1] when s belongs to some Holder 
space, it is known since the early 80s, thanks to Ibragimov and Khas'minskij 
Ibragimov and Khas'minskij (1981), that the minimax rate with respect to 7i{L,a) 

2 2a 

for the quadratic risk is of order L^^+in 2a+i ^ with any L > and a > 0. 

The following result settles that, applied to the collection of models (Pc), the Lpo- 
based procedure yields an adaptive in the minimax sense estimator of the density on [0, 1]. 

Theorem 5.1. Let us assume that {Reg) , {Ad) and {Ran) hold and that the collection 
of models is that one denoted by (Pc). Furthermore, assume that the target density s G 
TC (L, a) for L > and a £ (0, 1] . Then, 



sup E 

sen(L,a) 



<KaL^n^+o(-], (10) 



n 



for a given constant Ka independent from n and s. in derives from the Lpo risk mini- 
mization over Mn- 

2 2a ^ 

Since the minimax risk is of order L2a+i n 2o!+i ^ 'g ^ is adaptive in the minimax sense 
with respect to (L, Q!)}^^Q^^g(Q . 
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Remark 4. This result still holds with any polynomial collection of models satisfying 

the requirements of Theorem 4-1, md including models with dimension of the order of 

1 1 

Ll + 2a 7T,1 + 2q . 

Proof. The idea is simply to use Theorem 4.1 and derive the upper bound from 



E 



S - Sr^ 



\S — Sr 



+ E 



'm '^m I 



For the bias term, we have 



S - Sr. 



AeA(m) 



o / \ ; I s(t) — s(x)] dx ] dt, 



(^J^lt-x^dx^ dt {sen{L,a)). 



< > L^L>,;, / I / \t-x\" dx] dt 

AeA(m) 

< Ca (after integration) , 

1 



where C^, = 4 (q + 2) (1 + af {2a + 3) 
On the other hand, 



E 



'm I 



Vr 



< 



Vrr. 



< 



n 

^™lloo 

n 



n 



SUPxg[0,l] EagAM ^lix) ^ Drn 

n 



n 



Hence under the same assumptions as Theorem 4.1, we get that there exists C > 1 and 
K > such that 



E 



s — s , 



<C{Ca inf <^ + 



m&Mn n 

Now, let us define the sequence {-Dm„}„ such that for each n, 

1 1 1 1 1 

Then, we derive that it exists K' > such that 



+ 



n 



inf E 

mGMn 



S Sr. 



hence the expected result. 



□ 



5.4 Besov functional spaces 

The present section aims at deriving adaptivity in the minimax sense with respect to Besov 
spaces. This goal is reached thanks to results of Section 4.2 as well. 
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5.4.1 Overview of Besov spaces 

Let us start by briefly recalling in what Besov spaces and balls consist in (see 
DeVore and Lorentz, 1993, for an extensive presentation on this matter). 

For a > and < p < +oo, a function / in ([0, 1]) belongs to the Besov space 
B^^p = (LP([0, 1])) if l/leg^ ^ < +00, where 

I/leg, := sup {t-"u;, (/, t) } , r=[a] + l, 

with 

LOr if, t), := sup II (/, •) lip, and (/, x) := H ("l)'" V + kh) . 
\h\<t VJ 

I • |gQ^ p is a semi-norm, while the metric is provided by the following Besov norm 

Moreover for a given real R > 0, let us define the Besov ball of radius R by 

^S.,p(^) = {/Gi^lll/lle«„<i?}. 
In the sequel, the particular case where p = 2, that is B"^ 2 fo'^ > is considered. 

5.4.2 Piecewise and trigonometric polynomials 

In the same way as in Section 5.3, the strategy consists in deriving adaptivity results from 
the oracle inequalities of Section 4.2. Adaptivity heavily relies on the involved model 
collection through its approximation properties. 

The following results therefore state adaptivity in the minimax sense for both (Pp) and 
(Tp) collections, with respect to respectively different Besov spaces. 



The next theorem settles adaptivity with respect to Besov balls B^ 2i^) for < a < r, 
where r denotes the smallest integer larger than the degree of polynomials in (Pp). 

Theorem 5.2. Let us consider the collection of models (Pp) made of piecewise polyno- 
mials of degree less than r and assume that {Reg) , (RegS) , (Ad) , and [Ran) hold. 
Then for R > and < a < r, 



sup 



E 



< CaR 1+2" n 1+2" + { - 

n 



(11) 



where Ca denotes a given constant independent from n and s. 
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Proof. The proof follows the same strategy as that of Theorem 5.1 in that it essentially 
relies on approximation properties of models in (Pp). 



If Sm denotes a model of dyadic piecewise polynomials of degree less than r on each 
one of the 2^" regular dyadic intervals, the result in page 359 of DeVore and Lorentz 
DeVore and Lorentz (1993) states that provided r > a, 



inf < Ka,r\s\B'^ ADm) 



2a 



for a positive constant K^^r- 
Since s G 2i^)' comes 



As for the variance term, 

E 



< 



< 



(by {Reg3)). 

n n 

Under {Reg) , (Reg3) , (Ad) and (Ran) we apply Theorem 4.2 to derive 



E 



R^ + 



D 



m 

n 



K 

+ - 

n 



< r < inf 

y ' meMn 

where K'^ j. is a positive constant. 

The conclusion results from the same calculation as in the proof of Theorem 5.1 with 

1 1 1 1 1 



□ 



Unlike the previous result, we now turn to Besov balls 2^^) ^^^^ value of a > 0, 
which is enabled by the use of trigonometric polynomials, that is (Tp). 

Theorem 5.3. Let us consider the collection (Tp) made of trigonometric polynomials 
and assume that (Reg) , {Reg2) , (RegS) , {Ad) and {Ran) hold. 
Then for R > and a > 0, 

1' 



sup E 



< C'Rw^ n 1+2Q + O 



n 



(12) 



for a given constant C'^ independent from n and s. 

Proof. The same scheme of proof is used, except we need for an approximation result 
applying to trigonometric polynomials, which is also provided in page 205 of the book by 
DeVore and Lorentz (1993). Indeed considering models in (Tp) for any a > 0, it comes 



inf ||s — m|| < Kn\s 

U&Sm 



a\^\B° 



{Dr, 



for a constant Ka > 0. Assumption {Reg3) enables to conclude as in the previous theorem. 

□ 
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6 Conclusion 



6.1 Summary of main contributions 

In this work, CV has been studied as a model selection procedure in the density estimation 
setup. First, closed- form expressions have been derived for the leave-p-out (Lpo) estimator 
of the risk of projection estimators. These expressions drastically reduce computation time, 
which is a crucial issue, and also make y-fold cross-validation (VFCV) completely useless 
since it is more variable and expensive to carry out than Lpo. As an estimator of the risk, 
closed-form expressions for bias and variance of the Lpo estimate are provided as well. 

Second, the Lpo estimator is embedded in the model selection via penalized criterion 
framework, which enables to shed new lights on the choice of p, the cardinality of the 
test set, with respect to the amount of penalization. It is shown that a wide range of 
penalization is available from the smallest one when p = 1, to penalties of the same order 
as BIC. Loo is definitely inappropriate to recover the true model as well as to perform 
model selection with too rich collections of models, especially exponential ones. The 
conclusions drawn here are all consistent with previous empirical results such as those of 
Breiman and Spector (1992) for instance. 

Finally, two oracle inequalities are settled in density estimation with polynomial col- 
lections of models. These optimality results hold provided the ratio < p/n < 1 is neither 
too small, nor too large. To the best of our knowledge, these oracle inequalities are the 
first non-asymptotic results applying to Lpo in the density estimation setting. Further- 
more with an appropriate choice of model collections, it is shown that CV procedure leads 
to estimators that are adaptive in the minimax sense with respect to Holder as well as 
Besov spaces. 

6.2 Discussion 

On the one hand, the closed- form expressions settled on the present paper address the 
crucial issue of resampling procedures, that is their high computational complexity. More- 
over, the broad class of projection estimators for which such formulas are obtained allows 
extensive applications of Lpo to wavelets, piecewise polynomials, and so on. . . 

Besides, empirical evidence has been given in Celisse (2008) of an intricate relationship 
between the behaviour of Lpo with respect to p as model selection procedure and the size 
of the polynomial collection of models. In particular, it has been shown that Loo may 
suffer from overfitting with a polynomial collection of models provided the latter is " large 
enough". The analysis of this relationship deserves further investigations in order to 
describe the precise settings in which such troubles can occur. For instance, specifying the 
minimal value from which overfitting can be avoided seems highly desirable. 
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7 Proofs 

7.1 Closed- form Lpo estimator 
7.1.1 Proof of Lemma 3.1 

The first remark is that for each e G £p, we have \/t G [0, 1], 

- 1 1 " 

P jee A Pj=l A 

1 " 

iee j=l j'ee A 

Then, the Lemma follows from the following combinatorial results 
Lemma 7.1. For any i ^ j ^ k £ {1, . . . , n}, 



^ l(i6e)l(iGe)l(fcGe) = _ i) ^''^ Z ^ (*ee) ^ (jGe) = (^^ _ ^) , 



where the sum is computed over the resamples: Indices i, j, and k are kept fixed. 

Proof. YleGSp ^{j&e) t>e interpreted as the number of subsets of {1, . . . ,n} of size p 
(denoted by e) which do not contain j, since j G e. Thus, it is the number of possible 
choices of p non ordered and different elements among n — 1. 

The other equalities follow from a similar argument. □ 

7.2 Moments calculations 
7.2.1 Proof of Proposition 3.2 

The expectation is a straightforward consequence of (4). 

The variance calculation is not difficult, but very technical: Only the main steps of 
this proof are yielded. 

First, let us define Ax = Yl]=i fxi^j) ^^'^ B\ = '}2j^k Vx{Xj)f\{Xk)- Set a = n - 1 
and (3 = n — p + 1, such that 



n{n - l)(n - p)Rp{m) = ^ {aAx + l3Bx) 
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Then, 



A 

+ Y («'^A^A' + P^BxBx' + 2aPAx B, 



After some calculation, the different terms are respectively equal to 

A A 
A A 

Y AxBx = Y [ '^tiP^P^x + t2P^l [P^xY 



E 



E Y ^A^A' = n 

Xjty 



IE E^aW -E^v^: 



+ h 



2\2 



-EM) 



E Y BxBx' = 2h E (P^xV^yf + 

A^A' At^A' 



+ 



E(^^x{X)Pipx^ -YP^liP^xf 

E(^^Af 1 -E(W 

E ( E ^A w 1 E (^^A')' - E ( E ^i(^) (^^a)' 



E Y AxBy = 2ti P^l^X'P^X' + t2 
A^A' A^A' 

On the other hand. 



(n{n - l){n - p)E Rp{m) 



2 2 

n a 



Combining these two expressions yields the variance after some simplifications. 
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7.2.2 Proof of Corollary 3.4 

For every model m G A^„, 

A X 

= ^j;Var(^,(X))-^(P(^,)^ 
A A 

7.3 Theorem 4.1 

At the beginning of this section, several preliminary results are enumerated, which are 
useful in the proof of Theorem 4.1. Then, the main steps of the strategy are briefly exposed, 
and the complete proof of the main result is finally provided. Proofs of preliminary results 
are given in the sections following the proof of Theorem 4.1. 




7.3.1 Preliminaries 



Notation First of all, let us define some notation that will be useful in the sequel. 

For every 1 < p < n — 1 the Lpo risk estimator associated with the estimator Sm is 
denoted by Rp{m). For every m, set 



Lp{m) = Ei?p(m) 



such that Lp{rn) := E Rp{m) 
basis of Sm- Moreover, we set 



E 

AeA(m) 



\m=ifL 



. For each m, Wx} x^^m) denotes an orthonormal 



Em = E[x^{m)] and 6, 



and Vm = E[(j)^{X)], 

^J^liv^x) , 
X 

2n — p 



"'^ (n — l)(n — p) 

Remark 5. x^('^) '^^ ™i ^ true statistic, but is only somewhat similar to it. 

Two elementary but useful properties are repeatedly used in the sequel: For any a,b > 



0, 



(Roo) Va + b <^/a + ^/b, 

{Squ) 2ab < r]a'^ +r]-^b'^, V?? > 0. 
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Preliminary results Several preliminary results are then provided. They will be re- 
peatedly referred to within the proof of Theorem 4.1. 

The first result deals with the relationship between Rp and its expectation for each 
model. 

Lemma 7.2. For any m G Mn, 

Lp{m) - Lp{m) = [Em - Ef^] - (\\s - s^||^ - \\s - gmln , 

n — p \ J 

Rp{m) - Lp{m) = 6'„_p - (1 + 6'„,p) [ x^(m) - S„ ]- 2(1 + 6'„,p)i/„(sm). 



In Lemma 7.2, we see that Un {4'm) appears in the expressions. The next Proposition 
enables to upper bound the deviation of this quantity. It is a consequence of Bernstein's 
inequality (see Massart, 2007). 

Proposition 7.1. With the above notations, let z > and C > be any positive constants 
and for each m, let us define ym = z + C uEm- Then, we have 



Moreover if {Ad) holds, we have 



n 



n 



< 2e"^'". 



ym ~r yn 



n 



n 



where Si is a positive constant independent from n. 



Besides, since x^{m) = YliX (v'a), a handy way to study this x^-like statistic is to 
introduce an event of large probability on which we are able to get some control of I'ni'-Px)- 
The event VLn (e) is therefore introduced: 

0„(e) = Vm G A^„, VA G A(m), \un{ipx)\< ^ 



where n{t) = 2{t^^ + 1/3). 

Another use of Bernstein's inequality provides the following Lemma. 

Lemma 7.3. Set e > and assume that (Reg) , {Reg2) and {Pol) hold. Then, 



Ma > 0, 

2e2 



[l^^(e)]<2n2+^e-^ 



lloo'?(^) 



(log n) 



where r,{t) = ^(f)(^(t')+2t/3) " 
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This Lemma turns out to be useful in order to assess the concentration of 
around its expectation. This result may be found in Massart (2007) and is a consequence 
of Talagrand's inequality. 

Proposition 7.2. Set e > and for any C , z > 0, Xm = z + C nEm. Let us assume 
that {Reg) , {Reg2) and {Pol) are fulfilled. Then, 



Mm G Mn, 



V^xMlQ4e) > + ( \/^+ V2||s| 
Furthermore if {Ad) holds, 

3m G TUn I > (1 + e) ( \JnEm + ^2 ||s| 



< e" 



< S2 e-^ 



where S2 > denotes a positive constant independent from n. 

Finally, in Lemma 7.2, it remains Vn{sm) for which nothing has already been made. 
The control of this quantity results from the following lemma. 

Lemma 7.4. Set m, m' G A4n- Then for any p > 0, 



t 



sup KItt-^] <{! + p)x {m) + (1 + p'')x'{m'). 



7.3.2 Outline of the strategy 

Let us now describe the outlines of the strategy. 

Since in = Argmin^g_y\^^i2p(m), it comes that for every m G Mn, Rp{iTi) < Rp{m) 
which implies 



Rp{m) — Lp{m) < Rp{m) — Lp{m) + [Lp{m) — Lp{in)] 
Then, Lemma 7.2 applied to (13) yields 



(13) 



\s - SfnW + nOn^pEfn - (1 + On,p)X {fn) < \\s - S^W + uOn^pE^m " (1 + 6'n,p)X {m) 

+ On,pl^n {4>m — (pm) 

+ 2(1 + en,p)l^n {Sfn " Sm) • (14) 



Main steps 

• In the oracle inequa^ 
is something like E 



ity one has in mind, the left-hand side of the final inequality 



which is equal to E 



+ E[x'(m) 

II Il2 



+ 



with the present notations. However in (14), the left-hand side is E 

E [Efn]- In order to relate E [Ef^] to E [x^("^)] i the discrepancy Em — x^(™') will 
be uniformly controlled over Ai^ thanks to both Lemma 7.3 and Proposition 7.2. 
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• An upper bound of Vn {<t>m — (t^m) is obtained thanks to Proposition 7.1, so that 
Un {4>fh) is related to -E^. 

• Finally, Vn{sfh- Sm) may be upper bounded thanks to Lemma 7.4, independently 
from Efn and will therefore be dealt with later. 

• Combining these different steps, the desired inequality is derived except on a set of 
small probability (18). The conclusion results from the following lemma: 

Lemma 7.5. Let X and Y be two random variables such that \/z > 
0, F{X >Y + Kiz + K2) < Sfi-^ where Ki, K2, H > 0. Then, we have 

EX < EY + KiT, + K2. 
Proof. With Z = X - Y - K2, one gets ¥ {Z > Kiz) < Se'^ Then, 



EZ < E 



r+oc "I /'+00 

/ t^t<z)dt = ¥[t<Z\dt 
Jo J Jo 



< Ki I ^e^' dz = KiJ:. 




7.3.3 Proof of Theorem 4.1 

Proof. According to the previous remarks. Proposition 7.1 is applied to u., 
The successive use of (Reg) , (Squ) with any r/ > 0, and (Rao) provides 



□ 



n 

Moreover, note that 

V^ = Y,^W\{X)] =nE^ + \\s„,f <nE^ + \\sf . 

X 

Hence with Um = z + C nEm, 



Similarly, (Reg) entails that 



Vm < -irnEm + -Z, 

on 6 6 
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which leads to 



< nE„ 



(J 



+ 2z 



nEf, 



C 

o 



+ 2r/$||sf , 



except on an event of probabihty less than Sie ^. 

Set e" > and let us choose r/ = e'7(3$) and C = 2e" / [r]'^ + $/3). Then it comes that 



Wn {(pm - 4>m)\ < nEme" + nE.ff,e" + 



1 3 

3^7 



+ 2^||.|P 



Plugging this into (14) provides 



||s - Sail + nBn^^iX - e")Eff, - (1 + 9n,p)x'^ifh) 
< \\s - Smf + nen,p{l + e")Em - (1 + en.p)x^{m) + 2(1 + &„,p)zy„ {s 

3 





" 1 3 " 


^2z$ 


.3 ^ 



+ 2^\\s\? 



(15) 



except on an event of probability less than Eie 



On the other hand, Proposition 7.2 implies that for a given e > 0, except on a set of 
probability less than ^26"^, we have 



Mm £ Mn, Vnxim)ln„(e) < (1 + e) ( Vn^+\/2\\s 
Using Xm = z + C'nEm and (Roo) , we get 
Vnxim)ln^(e) < (1 + e) ( VnE, 



l + j2\\s\\^C' 



which in turn, combined with {Squ) , implies for any x > 

n 2 



+ \/2|klL^ , 



x'Mln„(e) < + {l + x)E„ 



l + ,/2||s||^C' 



+ {l + x~^)^^^z] . (16) 
n 



It holds for the particular choices x = e and C = (l — \/l + e) / (2 ||s||^), which results 
in 



1-e" 



1-e" 2\\s\ 



e) n 
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with probability larger than 1 — S2e~^. 

From the above result and (15), it comes that on 0^(6), with probability larger than 
1 — (Si + S2) e~^, we have 



|s — II + I nO. 



1-e" 



"'P(l + e)4 

< lis - Smf + nen,p(l + e")Em - (1 + en,p)xHrn) +2(1 + 

1-e" 



e(l+e) 



2||s|L +2$ 



1 3_ 
3^7 



e" 2 
~t~ 2^77, p-— ||s|| 



Now for any e > 0, we define e' > such that \/l — e' = (1 + e) ^ and let us take e" 
satisfying 1 — e" = \/l — e'. Then, the above inequality becomes 

||s - Smll^ + [n6n,p (1 - e') - (1 + On,p)] X^ifh) 



< lis — Smil + nO^ 



n,p 



n,p)^n [Sfn S 

1 



+ 



3 



+ 26n.'. 



i-VT 



^ II l|2 



(17) 



The following point consists in deriving an upper bound for f„ {sfh — Sm)- It results 
from the following inequalities and Lemma 7.4. Indeed, we have 



2l^n (Sm Sfo) ^ 2,lJi-i 



\Sfh - Sm\\ < 2 sup l/„ 



t 



I ^rh I 



Moreover, ||sm — Sm|| < \\sm — s\\ + ||s — Sm\\ and a double use of {Squ) give for any x > 0: 



2fn (Sm - Sm) < {I + x) SUp Z^n TTTTT + 



t 



m 

Finally, Lemma 7.4 yields that for any p > 0, we have 



2 + x 



l|2^2 



2Un {Sfn - Srn) < (1 + x) [ (1 + p)x^(^) + (1 + p'^i^) ] 



+ 



2 + x 



l|2^2 
X 



With X = e' and p = e'(l + e') , we get 

1 + 2e' 

2iyn (sm - Sm) < (l + 2e') x'(m) + (1 + e') —x^m) 



+ 



2 + e' 



l|2^1ll 

5m S|| + / IIS?; 
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Plugging this in (17) yields: 

On the event ^Ini^), with probability larger than 1 — (Si + S2)e~^, we have for any 
m G Mn 



2 + e' 



\s-Sfnf+[n9n,p{l-e)-2{l + 9n,p)il + e)]x^ 



m 



< 



l + -{l + dn,p) 



+ n6, 



n,p 



En 



+ 



1 + 2e' + 2e' 



/2 



(l + a„,p)x'(m)+0„,p(^z + S), 



(18) 



where A 



/l-e' 



I + 3 

3 ^ l-^/T^ 



and 



Then, Lemma 7.5 allows us to take the expectation and get the following result. 



(^1 A^2)E 



< (V'3 V V'4) E 



■5 '5m II 

+ e„,p[yl(Si + S2) + S] 



(19) 



where Vi = (e' - 20n,p) (2 + e')"\ = n ^n.p (1 - e') - 2(1 + a„,p)(l + e'), V's = 1 + 
2/e' (1 + 9n,p), and ^4 = ri0n,p [2 - VT^] + (1 + e„,p) [ 1 + 2e' + 2e'^ ] /e'. 

In order to obtain a meaningful inequality, a necessary requirement is V'l, V^2j ^3, V'4 ^ 
0. This is already satisfied for tp^ and 1/^4 • We have only to check it for both ipi and 1(^2 ■ 
It turns out that if e' > 2/{n — 1), then p must satisfy 



4e' 



2 1 + e^ < :P < 1 2 
~" " ' - n - ~ e'(n - 1) - 2' 



provided 



1 + 3e' n 1 + 3e' 

4e' 2 1 + e' 

+ 



(20) 



< 1 



e'{n - 1) - 2' 



1 + 3e' n 1 + 3e' 
which is established by Lemma 7.6 for n > 29. 

Remark 6. In (20) since < e' < 1 hy definition, we have < 1. 

Finally to assert the existence of the constant F in Theorem 4.1, the ratio 
("03 V ijji) I (^1 A ^2) has to be bounded. One can easily check that all ■0fcS can be re- 
shaped as 

F{j), n) 



1 — p/n 



where F is a bounded quantity. Moreover by construction, the bounds in (20) lead to 
ipi = Q and "02 = 0, which should be prohibited since we would like to consider the ratio 
('(/'3 V iIja) I {ipi V 'ip2)- That is the reason why p/n must be slightly larger (resp. lower) 
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than each one of the above bounds, hence (Ran) . Furthermore since no bound depend on 
s, (Ran) gives the required constant F. A similar reasoning shows that it exists a constant 
K > depending on s and the constants of the problem but independent from n, such that 



< - 



which yields 



E 



lf7„(e) \\S- S, 



■01 A ^2 n 



<T{e,a,p) inf E 

meMn 



We now simply add the missing term E 



1 



S - Sr. 



S — S , 



+ 



K(e, s, a, 13, 6) 



n 



to both sides of the above 



inequality. It only remains to show that this term is of the right order: 



E 



< E 



[1. 



<Pf P[0^(e)]+E 



+ IE I \\Sm - Sfn\ 

Asm 



Lemma 7.3 then enables to deduce that the first term in the right-hand side inequality 
satisfies 



Vn, \\s\\'¥[n'^ie)]<\\s 



|2 ^ 

n 

for an appropriate choice of no > 0, depending on e, 6 and <I>. 
For the second one, Jensen's inequality yields 



E 



(^) 



Asm 



< E 



Asm 



Moreover, [Squ] with any > provides 

{ipx{X) - Pip^f < (1 + ii)ipl{X) + (1 + ii-'^)P^l. 
Finally, EA6A(m) ^\ = 4>m and P(l)fa < UmWoo lead to 



E 



(^) 



Asm 



< (2 + r/ + r/"i) 



(log n) 



thanks to (Reg) , and Lemma 7.3 enables to conclude. 



□ 
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7.3.4 Proof of Proposition 7.1 

Proof. Bernstein's inequality Massart (2007) states 



Vx > 0, 



> —V^vx H X 

n 3n 



with b > \(j)m{Xi) - E4im{Xi)\ and v = Ya=i [(pm.{Xi)]. 
Since Xi are i.i.d. and (j)m > 0, we iiave 

b=\\(pm\\oo aild V < nVm\\(pm\\oo ^ 

lience tlie first part of the proposition. 

For the second part of the resuh, the union bound combined with = z + C nE^ provide 



3m G Mn I \Vn {4>m 
CnE, 



milpo , llV^nilloo 

ym ~r ym 



n 



n 



< e ~ > e 



E 



<Sie^^ (^d) and (Po/) 



□ 



7.3.5 Proof of Lemma 7.3 

Proof. We recall that 



a„(e) = lyniGMn, VA E A(7?t,), (v3a)| < 



2e||s|loologra 
K(e)\/¥n 



Then, we deduce that 



Of \ \ q\\ 1 00" 71 

3m€Mn, 3Ae A(m) | |z.„(<^a)| > " ^ 



''lloo'?('') 



D>1 



(logn)2 



(logn)2 



(Bernstein) 



(Pol) 



{D < n) 



where r]{t) = 



(t)(K(t)+2t/3) 



□ 
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7.3.6 Proof of Proposition 7.2 



Proof. First, we notice that = \/ X^(^) may be also expressed as 



X(m) 



sup 

^AGA(m) "A" 



2^ OaV^a 
, A6A(m) 



> sup 

aeA 



E 

, A6A(m) 



ax^x 



wliere A is dense subset of 



A6A(m) 



1 and \ax\ < - 

AeA(m) 



Moreover, if we define the event 



Q = I sup Un (ifx) < t 
I AeA(m) 



for t > 0, then we deduce that 



x(m) < sup 



■'n ^ oa^^a 



, AeA(m) 



on 17 n {x("i) ^ -z}. 

Then, Talagrand's inequality applied to sup^g^ f„ ( SAeA(m) ^a<^a 



In sup 



E 

^AeA(m) 



ax^x 



> (1 + e) vVM + 



2||s| 



(21) 



gives for e, x > 0, 



< e" 



n 



with z = \/2 /n and t = 2e \\s\\^ [ K(e)$n/(log n)^ ] ^. 
Finally, the first result comes from both (21) and r2„(e) = 17. 

As for the second inequality, the choice Xm = C"^ Dm + z leads to 



3m E I V?^x("i)ln„(e) > (1 +e) \/ri^m + \/2 ||s 



<E2 e-^ 



C nE„ 



-C'CD+5 log D 



(Ad) and (Pol) 



D>1 



□ 
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7.3.7 Proof of Lemma 7.6 

Lemma 7.6. For n > 29, there exists < e < 1 such that 
C(.)>^ and ,^'1 +^<1 



n-1 l + 3C(e) n C(e)(?^-l)-2 

w/iere C(e) = [l - (1 + 

Proof. The first part is obvious since for a given n, we can choose < e < 1 such that 
C(e) > 2/(n - 1). Then with 6 = C(e) - 2/(n - 1), we have 

5(n-l)=C(e)(n-l)-2. 

After some calculations, it is easy to see that 

4C(.) ^2^1 



l + 3C(e) n C(e)(f^-l)-2 

,2^ + 6 n - 10 2n + 10 

(5^ 5 \ ^ < 0, 

n n (n - ly 

which is a polynomial of degree 2 in (5. 

For n > 29, the discriminant is positive and any 5 between the two distinct zeros yields a 
value for ({e) such that 

4CW ^2^j 2 



l + 3C(e) n C(e)(^-l)-2' 
which enables to conclude. □ 

7.4 Theorem 4.2 

7.4.1 Intermediate results 

The proof of Theorem 4.2 follows the same structure as that of Theorem 4.1. Only the 
main differences are reported here. These differences essentially occur in the control of the 
X^-type statistic. Since they are nearly the same, the following results are given (without 
or) with only short proofs. 

Let us start by introducing another event of large probability on which we are able to 
get the desired control. For any e > 0, 



where K{t) = 2{t-^ + 1/3). 

The following lemma is the counterpart of Lemma 7.3 and is devoted to control the 
remainder terms. It heavily relies on Bernstein's inequality. 
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Lemma 7.7. Set e > and assume that (Reg) , (Reg2) , {Reg3) {Ad) and {Pol) hold. 
Then, 

Va > 0, F[n'^{e)] < 2n2+'5e^w(ll^ll^i)(i°s")', 
where rj{t) = ^(i)(4t)+2t/3) " 

Now, we are in position to give the main result providing the desired control on the x^'type 
statistic. 

Proposition 7.3. Set e > and for any C, z > 0, Xm = z + C \/nEm- Assume that 
{Reg) , {Reg2) , {Reg3) , {Ad) and {Pol) are fulfilled. Then, for every m £ M.n, 

P V^x("i)ln4e) > (1 + e) {VnE^ + \l 2{\\s\\y l)^/<^ /iuE^x^ 1 < e" 
and furthermore, 

P 3m I V^xMln„(e) > (1 + e) + ^2(||s|| V l)^J^/inEZx^ <S 

where S2 > denotes a positive constant independent from n. 

Proof, (sketch of proof) It relies on Talagrand's inequality as well as of the following 
straightforward upper bound. 
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Vm, sup Var < ||s|| p||2^||<AmlL = \\4 \IUm\\^ < {\\s\\ V 1)0 

teSm, ||t||2=l 

7.4.2 Outline of the proof of Theorem 4.2 

The first main difference comes from the use of Proposition 16, which yields 

\/^xMln„(e) > (1 + e) {y^^+ ^2{\\s\\yl)y/¥/^nE,nXr^ 
on an event of high probability. 

From several applications of {Squ) and {Roo) , with p, C > 0, it comes 



□ 



2(iis|| vi)v¥7e^ 



<a/2(||s|| V 1) + J2{\\s\\ V l)^/¥/^C'nE, 



< p^/nE^ + p-\\\s\\\J l)^/¥/^z + C^/nE^, 

< {p + c) yG^ + p-\\\s\\vi)./¥/Cz, 
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2 

n 



with C" = C7 2{\\s\\yi)^¥/l 

Thus in the same way as (16), for every x > 0, we derive 

x'MlQ„(e) <(l + e)' [{l + x)E^[l + {p + C)f 

+ {p-\\\s\\yi)^) 

The following remains essentially the same and concludes the proof. 
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